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I.  INTRODUCTION 


This  report  compares  the  arithmetic  requirements  'of  several  efficient 
algorithms  which  compute  the  Discrete  Fourier  Transform  (DFT).  The  DFT  is  a 
powerful,  reversible  mapping  transform  for  discrete  data  sequences  with  mathe¬ 
matical  properties  analogous  to  those  of  the  Fourier  transform.  The  defini¬ 
tions  of  the  DFT  and  the  inverse  DFT  can  be  written  in  the  form 

N-l 

A(k)  =  £  x(n)exp(- j2nnk/N)  (1) 

n=0 

N-l 

x(n)  =  5!  A(k)exp(  j2xnk/N)  (2) 

k=0 


for  k=0,l,...,  N-l;  n=0,l,...,  N-l.  The  N-point  data  sequences  x(n)  and  A(k) 
are  generally  complex  and  are  often  used  to  represent  time  and  frequency 
series,  respectively. 

In  1965,  Cooley  and  Tukey  began  a  revolution  in  the  field  of  signal  pro¬ 
cessing  when  they  introduced  the  Fast  Fourier  Transform  (FFT)  as  an  algorithm 
for  efficiently  computing  the  DFT  [1].  The  FFT  reduced  the  number  of  com¬ 
putations  requi-ed  to  compute  the  DFT  from  a  number  proportional  to  1^,  to  one 
proportional  to  NLog2N.  This  reduction  of  computations  spurred  widespread 
application  of  the  DFT  to  many  problems  in  diverse  fields.  In  addition  to 
spectral  analysis  of  time  series,  the  FFT  has  been  used  for  fast  correlation 
of  sequences,  fast  convolution  of  sequences  for  the  purpose  of  digital  filter¬ 
ing,  and  radar  Digital  Beam  Forming  (DBF).  In  DBF  applications,  the  output 
of  each  element  of  a  receive-only  array  antenna  is  independently  converted 
into  complex  baseband  samples.  A  DFT  is  then  used  to  transform  the  data  into 
a  simultaneous  set  of  receive  beams  uniformly  distributed  in  space  [2]. 

The  ever  increasing  importance  of  the  DFT  algorithm  has  led  to  the  devel¬ 
opment  of  many  new  efficient  algorithms  requiring  far  less  computations  than 
the  FFT.  This  report  examines  the  multidimensional  DFT  decomposition  theory 
central  to  many  of  these  algorithms,  and  gives  a  brief  introduction  to  the 
radix-2  FFT,  radix-4  FFT,  mixed  radix  fast  Fourier  transform  (H"FT),  prime 
factor  algorithm  (PFA),  Winograd  Fourier  transform  algorithm  (WFTA),  and  SWIFT 
algorithms.  In  addition,  the  arithmetic  complexity  of  these  algorithms  is 
compared  for  various  one  and  two-dimensional  transform  sizes.  Included  in  the 
comparison  are  the  number  of  real  additions,  real  multiplications,  total  real 
operations,  total  equivalent  real  multiplications,  and  integrated  circuit  chips 
required  for  each  algorithm. 

II.  MULTIDIMENSIONAL  DFT  THEORY 

All  of  the  efficient  DFT  algorithms  examined  in  this  report  are  based  on 
Good’s  standard  multidimensional  DFT  decomposition  technique  [3-4].  This 
technique  decomposes  a  large  one-dimensional  DFT  into  a  sequence  of  smaller 
DFTs  which  are  combined  with  twiddle  factors  (i.e.,  complex  weights  or 
multiplications).  The  number  of  multiplications  and  additions  required  to 
compute  a  DFT  is  greatly  reduced  by  computing  its  decomposed  small  point  DFT 
transforms,  even  though  the  twiddle  factors  increase  the  computational  load. 
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However,  the  multidimensional  decomposition  is  only  applicable  to  the  DFTs  of 

length  N,  where  N  is  factorable  into  integer  values  (i.e.,  N  =  N^*N2* . 

Np).  In  order  to  circumvent  this  requirement,  DFTs  can  be  appended  with  zeros 
to  give  a  length  that  is  factorable. 

The  basic  mechanism  of  the  multidimensional  decomposition  is  transforming 
the  one-dimensional  data  sequence  of  length  N  =  Ni*N2  into  a  two-dimensional 
rectangular  array  of  N[  rows  and  N2  columns.  The  N-point  DFT  can  then  be  com¬ 
puted  by  performing  N2-point  DFTs  on  all  the  rows,  and  performing  N^-point 
DFTs  on  all  the  columns,  and  in  some  cases,  multiplying  the  intermediate 
results  by  complex  twiddle  factors.  If  desired,  the  and  N2~point  DFTs  can 
be  decomposed  if  they  are  factorable.  This  process  can  be  applied  repeatedly 
to  the  one-dimensional  DFTs  until  the  original  N-point  DFT  has  been  completely 
decomposed  into  all  of  its  integer  factors. 

A  unique  or  one-to-one  mapping  function  is  needed  to  map  the  one¬ 
dimensional  arrays  A(k)  and  x(n)  of  the  DFT  expression 

A(k)  -  l  x(n)  W*,1*  (3) 

n 

into  the  two-dimensional  arrays  A(k..,k9)  and  x(n.,n9)  of  the  two-dimensional 
function  [5] 

A(kx,k2)  =  I  I  x(ni,n2)  W*/*  (4) 

nl  n2 


where  kj,  n^  *  0,1,...,  -  1;  k2»  n2  =  0,1,...,  -  1;  and 

Wj/*  =*  exp(-j2imk/N) .  Although  many  different  mapping  functions  exist,  the 

mapping  function  fundamental  to  most  fast  DFT  algorithms  is 


n  =  (L^n^  +  L2n2)mod  N 
k  =  (l*3ki  +  L4k2)mod  N 


(5) 


A  simple  mapping  of  this  form  is 
n  *«  (n^  +R^n2)mod  N  j 
k  =  (^k^  +  k2)mod  N  I  . 


(6) 


For  example,  this  mapping  can  be  used  to  decompose  the  vectors  A(k)  and 
x(n)  of  an  eight-point  DFT  into  two-dimensional  functions  with  rows  and 
Ng  columns.  For  the  values  tfy  =  2  and  N2  =  4,  the  mapping  between  x(n)  and 
x(nx,n2)  is 


A 

x(ni,n2)  =  x(n^  +  2n2)mod  8  , 
as  shown  below: 


x(0,0)  =  x(0)  x(0,l)  =  x(2) 
x(l,0)  =  x(l)  £(1,1)  =  x(3) 


£(0,2)  =  x(4)  x(0,3)  =  x(6) 
x(l,2)  =  x(5)  x(l,3)  =  x(7) 


(7) 


(8) 
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Note  that  each  position  in  the  above  2x4  matrix  is  assigned  a  unique  value 
from  the  x(n)  vector.  The  mapping  for  the  output  values  is 

A(ki,k2)  =  A(4k^  +  k2)mod  8  (9) 

as  shown  below: 

A(0,0)  =  A(0)  A(0,1)  =  A(i)  A(0,2)  =*  A(2)  A(0,3)  =*  A(3) 

A(1,0)  =  A(4)  A(l,l)  =  A(5)  A(l,2)  =  A(6)  A(l,3)  *  A(7)  .  (10) 

The  mapping  of  (5)  can  be  substituted  into  Equation  (4)  giving 

A(k;|_,k2)  -  l  l  x(nlfn2)WN(Llnl  +  ^2n2>(L3kl  +  Hk2> 
nl  n2 


Is*-' 


or 

A(ki,k2)  -  I  I  i(n1,n2)WNL2I-4n2kVlI'4”lk2»NLlI'3niklW|)I-2l3n2kl  .  (11) 

nl  n2 

where 

A(ki,k2)  =  A(L3k^  +  L4k2)mod  N 
x(n3,n2>  xCL^ni  +  L2n2)mod  N 

L3,  L2,  L3,  and  L4  can  be  selected  using  the  results  of  a  theorem  from 
number  theory  to  insure  a  unique  mapping.  Case  A  of  the  theorem  applies  when 
the  factors  and  »2  are  mutually  prime,  that  is  1  is  the  largest  common 
integer  factor.  Case  B  applies  when  and  are  not  mutually  prime,  that  is 
Ify  and  N2  have  a  common  integer  factor,  \  ,  which  is  greater  than  1.  The 
notation  used  in  the  theorem  to  represent  these  two  cases  is 


CASE  A:  (  Nj_,  N2)  -  1 

CASE  B:  (N1,N2)  =  A  ,  (12) 

where  the  operator  (N^,  N2)  is  defined  as  the  greatest  common  integer  factor 
of  Ni  and  Ng.  The  theorem  can  be  written  in  terms  of  n  or  k  of  Equation  (5) 
as  they  are  of  the  same  form.  For  simplicity,  however,  the  theorem  will  be 
expressed  for  both  the  n  and  k  mapping . 

Theorem:  The  necessary  and  sufficient  conditions  for  the  mapping  of 
Expression  (5)  to  be  unique  are: 

CASE  A: 


1)  Li  =  aN2  and  L2  pN^  and  (oc,Ni)  =  (L2»N2>  =  1 
L3  =  yN2  an<*  ^4  ^6%  and  (y,lfy)  =  (L4,N2)  =  1 


L’iV*  .  *«'  ^  ^  <  _\  •  w 
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r*Zr*^j  'V7'iTi  «"  i" * 


'/v  *;■*  vv.v. 


foj 


2) 

Il  + 

aN2 

and 

l2 

=  P% 

and 

(1*1,  Nl) 

=  (P,n2)  = 

1 

l3  * 

yn2 

and 

l4 

=  6% 

and 

(L3.N1) 

-  (6,N2)  = 

1 

(14) 

3) 

L1  ■ 

aN2 

and 

1-2 

-  P»L 

and 

(a.%) 

=  (P,n2)  = 

1 

^3  - 

yh2 

and 

l4 

=  6Nj^ 

and 

(Y.%) 

-  (6.N2)  = 

1 

(15) 

CASE  Bi 

1) 

Li  = 

and 

^2 

t  PNl 

and 

(a.%) 

=  (L2,N2)  = 

1 

L3  = 

yli2 

and 

l4 

r  6% 

and 

(Y.Ni) 

=  (L4,N2)  = 

1 

(16) 

2) 

Ll  * 

aN2 

and 

^2 

-  p:i 

and 

(Li,»l) 

=  (P,N2)  = 

1 

yn2 

and 

l4 

= 

and 

(L3>Ni) 

-  (6,N2)  = 

1 

(17) 

The  vs 

riab.,  e' 

i  i*1  j 

►  l2 

•  «. 

,  and  f 

i  of 

this  theorem  will 

be  used  for  the  mapping 

of  n  s 

ad  the 

iables  7 

3^  k4l 

,  Y » 

and  6  will  be  used 

for 

the  mapping  of  k. 

All  of  tin.. *.s-  •. ariables  are  .ion-zero  positive  integers. 

CASE  B  of  the  theorem  will  be  considered  first  as  it  is  the  basis  of  the 
Decimation-In-Time  (DIT)  and  De . imation-In-Frequency  (DIF)  algorithms.  These 
algorithms  are  used  to  implement  the  familiar  radix-2  and  radix-4  FFT 
algo-ithms. 

The  BIT  algorithm  is  derived  by  using  Equation  (17)  for  the  mapping  of  n 
and  Equation  (16)  for  the  mapping  of  k.  Combining  these  expressions  with  that 
of  (5)  gives  the  mapping 


n  =  (L]_n^  +  PNin2)mod  N| 
k  =  (Y^kj.  +  L4k2)mod  N I 

k  J 

where  ^  aN2  and  L4  ^  6Ni  . 

Substituting  this  into  Equation  (11)  gives 

A(kik2)  =  l  l  i(ni,n2)WN2PL^n2k2wNLlL4n1k2wNiL1Yn1k1 
nl  n2 

Nate  that  the  last  term  of  Equation  (11)  is  eliminated  as 
WNPYn2k1(N1N2)  =  exp(- j2*pYn2kiN/N) 

=  (exp(-j2x))PYn2^1  -  1  . 


(18) 


(19) 


(20) 


Choosing  the  values  Li=L4  =  8=  y:=1  satisfies  the  theorem  and  when  substi¬ 
tuted  into  Equation  (19)  gives 


A(ki,k2)  =  \  l  x(ni,n2)WN2n2k2wNnik2wi^nikl  , 
nl  n2 


UD 


where 


A(ki,k2)  =  A(N2kx  +  k2)mod  N 

a 

x(n^,n2)  =  x(ni  +  N^n^mod  N  . 

The  Wjjnlk2  term  is  the  twiddle  factor. 

A  brute  force  computation  of  Equation  (21)  would  require  N  complex 
multiplications  and  N~1  complex  additions  for  each  value  of  the  A(ki,k2) 
array,  assuming  prior  combination  of  the  three  complex  exponential  terms. 

This  would  require  complex  multiplications  and  N(N-1)  complex  additions  to 
compute  the  DFT.  Fortunately,  the  number  of  operations  required  can  be 
reduced  by  using  one-dimensional  DFTs  on  the  rows  and  columns  as  suggested  by 
the  following  nesting  of  Equation  (21) 

A(k!,k2)  =  W^nlkl  I" WNnlk2  I"  l  x(ni,n2)WN  n2k2]  1  .  (22) 

nx=0  L  Ln2=0  -1  J 

The  innermost  bracket  is  a  function  of  n^  and  k2 

q(ni,k2)  -  l  i(n1,n2)WN2n2k2  (23) 

n2 

where  k2  *»  0,1,...,N2-1  and  n^  is  fixed  by  the  value  of  the  outermost  sum¬ 
mation  symbol.  This  is  obviously  an  Njj-point  one-dimensional  DFT  on  the  n^th 
row  of  data.  As  indicated  by  the  next  level  of  brackets,  each  of  the  N 
q(n^,k2)  values  is  multiplied  by  its  complex  twiddle  factor.  The  results  of 
the  two  Jnnermost  brackets  is  still  a  function  of  n^  and  k2 

h(ni,k2)  -  q(n!,k2)WNnik2  .  (24) 

Combining  Equations  (22),  (23),  and  (24)  gives 

%-l 

A(k!,k2)  -  l  h(n1,k2)WN^nlkl  t  (25) 

nj=0 

where  k].,  =  0,1, . . .  ,N]_-1  and  k2  is  a  fixed  value  for  each  column  of  data. 

This  is  obviously  an  N^-point  one-dimensional  DFT  on  the  K2th  column.  Thus, 
using  the  nesting  of  Equation  (22),  the  N-point  DFT  is  calculated  by: 

(1)  calculating  an  N^-point  one-dimensional  DFT  on  the  data  of  each  of  the 
Ify  rows;  (2)  multiplying  each  intermediate  transformed  data  point  by  a  complex 
twiddle  factor;  and  (3)  performing  an  Nj-point  one-dimensional  DFT  on  the 
twiddled  data  of  each  of  the  Ng  columns.  The  required  real  multiplications 
and  additions  for  this  process  can  be  expressed 

NRMULT  =  NlW2  +  l&Ul  +  4N  (26) 

NRADDS  =  Nja2  +  +  2N  ,  (27) 


7 


where 


Ui  =  number  of  real  multiplications  in  the  Nf-point  DFT 

=  number  of  real  additions  in  the  fy-point  DFT. 

This  method  is  generally  more  efficient  than  the  brute  force  computation  of 
Equation  (21).  Greater  efficiency  results  if  the  %  and/or  N2  one-dimensional 
DFT(s)  of  the  above  process  can  be  decomposed  into  still  smaller  factors. 

The  D1T  Equation  (21)  can  also  be  nested  as 

A(kj.,k2)  ~  1  Wi^nikl  ("  l  wN2n2lC2  |"^nl»n2)wNnik2l  1  •  (28) 

ni  L  n2  L  J  J 

The  computation  suggested  by  this  nesting  is  very  similar  to  that  of  Equation 
(22)  as  only  the  first  two  computation  steps  are  reversed.  For  this  nesting 
the  N-point  DFT  is  calculated  by:  (1)  multiplying  each  data  point  by  the 
appropriate  complex  twiddle  factor;  (2)  calculating  an  Nr-point  DFT  of  each 
row  of  the  intermediate  data;  and  (3)  performing  an  N^-point  DFT  of  each 
column  of  the  data  calculated  in  step  2.  The  arithmetic  requirements  for  com¬ 
puting  Equation  (28)  are  obviously  the  same  as  Equation  (22). 

A  final  way  the  DIT  Equation  (21)  can  be  nested  is 

A(kj.,k2)  =  l  wN2n2k2  F I  |*x(ni,n2)WNnik2l  1  .  (29) 

n2  L  ni  *-  ->  J 

For  this  reverse  nesting,  the  N-point  DFT  is  calculated  by:  (1)  multiplying 
each  data  point  by  the  appropriate  twiddle  factor;  (2)  calculating  an  N^-point 
DFT  of  each  column  of  the  twiddled  data;  and  (3)  calculating  an  N^-point  DFT 
of  each  row.  This  also  has  the  same  arithmetic  requirements  as  the  other  DIT 
nestings  of  Equations  (22)  and  (28). 

The  DIF  algorithm  is  obtained  by  using  Equation  (16)  for  the  mapping  of 
n  and  Equation  (17)  for  the  mapping  of  k.  Combining  these  expressions  with 
that  of  (5)  gives  the  mapping 

n  =  aN2ni  +  L2n2  mod  N 

■ 

,  k  =  L3ki  +  6Nik2  mod  N  (30) 

where  L2  ^  and  L3  ^  yN2  . 

Substituting  this  into  Equation  (11)  gives 

A(ki,k2)  -  I  I  x(ni,n2)WN2SL2n2k^WifLaL,3niklWNI'2L3n2kl  •  (31) 

nl  n2 

Nate  that  this  combination  of  CASE  B  eliminates  the  second  term  of  Equation 
(11)  as 

yNa6n]k2(N2N;[)  =  exp(-j2ita6n]k2N/N)  =1  .  (32) 
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Choosing  the  values  L2  =*  L3  =  a  =  6  =  1  satisfies  the  theorem  and,  when 
substituted  into  Equation  (31),  gives 


A(k!,k2)  =11  x(n!,n2)WN  n2k2W^nlklwNn2kl  , 


nl  n2 


where 


A(ki,k2)  =  A(kx  +  Njk2)mod  N 

A 

x(n^,n2)  =  x(N2n^  +  n2)mod  N  . 

The  term  is  the  twiddle  factor.  Like  the  DIT  algorithm,  the  DIF 

algorithm  requires  on  the  order  of  $  complex  operations  until  nested 
according  to  one  of  the  following  three  expressions: 


A(ki,k2) 


A(kx,k2) 


A(kltk2) 


Kj~i  r  n2-i 

l  V1*1!  I 

nx=0  L  n2=0 


r^-1  .  r 

_  £  >V2k2 


N2-l 

I 

n2=0 


tni-3 

L  1 


n2)WN 


x(n1,n2)WNlnikl 


[  1  o  wN1nikl[x(n1,n2)WNn2klj  J  . 


Equations  (34)  and  (36)  are  calculated  like  Equations  (28)  and  (29), 
respectively.  Only  the  twiddle  factors  and  the  mapping  are  different. 

Equation  (35)  is  calculated  by  (1)  calculating  an  Nj-point  DFT  of  each  column, 
(2)  twiddling  the  results,  and  (3)  calculating  an  N^-point  DFT  on  each  row  of 
the  twiddled  results.  All  three  nested  expressions  of  the  DIF  algorithms 
require  the  same  amount  of  computations  as  the  nested  DIT  algorithms. 

The  two  other  possible  combinations  of  CASE  B  of  the  theorem  involve  using 
Equations  (16)  or  (17)  for  both  the  n  and  k  mapping.  However,  neither  of 
these  maps  allow  the  elimination  of  a  complex  exponential  term  of  Equation 
(11).  This  prevents  the  efficient  nesting  of  the  two-dimensional  function  of 
Equation  (11). 

CASE  A  of  the  theorem  is  the  basis  of  all  DFT  algorithms  involving  mutually 
prime  factors,  including  the  WFTA,  PFA,  and  the  SWIFT  algorithms.  Using 
Expression  (15)  of  the  theorem  for  n  and  k  gives  the  mapping 


n  =  (aN2nx  +  |3Nxn2)raod  N 
k  *  (yN2kx  +  6Njk2)mod  N 


Substituting  this  into  Equation  (11)  gives 

A(kx,k2)  -  l  l  x(n1,n2)WN„^ Nln2k2w  “YN2nlkl  . 
nl  n2 


(38) 


Note  that  both  the  second  and  fourth  complex  exponential  term  of  Equation  (11) 
are  eliminated  by  this  mapping.  Good  [3]  suggested  using  the  values 


I  a  =  p  =  1  J 
6  -  ^Imod  N2? 

y  =  ^"imod  %  )  (39) 

where  6  and  y  are  multiplicative  inverses  of  %  and  N^,  respectively.  The 
multiplicative  inverse  of  a  number  N  is  defined  as  the  unique  integer.  A, 
which  belongs  to  the  set  (0, 1, . . . ,M-1)  and  satisfies 

(A*N)mod  M  =  1  .  (40) 

For  example,  if  =  3  and  Nj  =  7  are  used  in  Equations  (37)  and  (39)  then 
6=5  and  y  =  1,  giving  the  mapping 

n  =  (7n^  +  3n2)mod  21  | 

,k  =  (7k!  +  15k2)mod  21 J  .  (41) 

The  multiplicative  inverses  of  Equation  (39)  are  guaranteed  to  exist  because 
%  and  N2  have  been  restricted  to  being  mutually  prime  for  CASE  A. 
Substituting  the  values  of  Equation  (39)  into  Equation  (38)  gives 

A(ki,k2)  =  I  l  x(ni,n2)WN2n2k2WNlnikl  ,  (42) 

nl  n2 

where 

A(k!,k2)  =  A( ( N2"'^mod N! ) N2k!  +  (N!~^modN2)N!k2)niod  N 

A 

x(n!,n2)  =  x(N2n!  +  Ifyn2)mod  N  . 

Because  there  is  no  twiddle  factor,  Equation  (42)  can  be  computed  like  a  two- 
dimensional  DFT.  Thus,  a  DFT  of  length  N  =  N!*N2  where  (N!,N2)  =  1  can  be 
computed  according  to  the  two  obvious  nesting  arrangements 


A(kx,k2)  =  I  wNinikl  T  l  x(n!,n2)WN2nzk2l  , 
nx=0  ±  Ln2=0  J 

n2-i  i-nl-i  .  I 

A(kl,h2)  °  I  wN2n2  2  ^  *( nl , n2 )  Wn.°2  1  . 

no=0  Lni=0  *  J 


fNZ-!  ^ 

I  x(n1,n2)WN ' 
Lnn=0 


The  N-point  DFT  as  nested  in  Equation  (43)  can  be  calculated  by  performing 
N2-point  DFTs  on  all  rows  of  the  data  and  performing  Nj-point  DFTs  on  all 
N2  columns  of  the  intermediate  data  resulting  from  step  1.  The  nesting  of 
Equation  (44)  simply  dictates  calculating  the  column  DFTs  before  calculating 
the  row  DFTs.  The  above  method  is  referred  to  as  the  row/column  technique. 
For  the  above  two  cases  the  computational  requirement  for  calculating  an 
N  =  Ni*N2  point  DFT  where  (Ni,N2)  =  1  is 
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MtMULT  =  ^2  +  (45) 

MtADDS  =  1^02  +  .  (46) 

If  N2  can  be  factored  further  such  that 

N2  -  N3*N4  ,  (47) 

where  N3  and  are  mutually  prime,  then  the  arithmetic  requirements  for  com¬ 
puting  the  1^-point  DPT  are 

p2  =  N3U4  +  ^3  (48) 

a2  =  N3a4  +  1^03  .  (49) 

Thus,  if  N  is  factored  such  that 

N  -  NL*N3*N4  ,  (50) 

where  all  the  factors  are  mutually  prime.  Equations  (48)  and  (49)  can  be 
substituted  into  Equations  (45)  and  (46)  to  give  the  requirements 

MtMULT  =  NlN2U3  +  N3N1V2  +  N2N3P1  (51) 

NRADDS  -  %I^a3  +  N^Nja2  +  N^ai  .  (52) 

In  general,  when  N  is  factored  into  r  mutually  prime  factors 

N  =  NL*N2* - N,-  ,  (53) 

the  arithmetic  requirements  are  simply 


r  Vi 

MtMULT  =  S  |  ~  (54) 

i=l  ^ 

J  a, 

MtADDS  -  N  I  ~  (55) 

i=l  ** 

III.  EFFICIENT  DFT  ALGORITHMS 

The  radix-2  FFT  is  restricted  to  lengths  N  where  N  is  a  power  of  2 
(i.e.,  N  =  2r)  [6].  The  radix-2  algorithm  is  based  on  a  complete  decomposition 
of  the  N-point  DFT  into  r  2-point  DFTs.  For  N  =  2  the  DFT  definition  (see 
Equation  (1))  simplifies  to 

A(0)  =  x(0)  +  x(l)  (56) 

A(l)  =  x(0)  -  x(l)  .  (57) 

Thus,  only  two  complex  additions  are  required  for  each  2-point  DFT.  As  shown 
in  the  last  section,  however,  twiddle  factors  or  complex  multiplications  are 
required  between  each  2-point  DFT  as  the  factors  of  N  are  not  mutually  prime. 
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The  number  of  real  multiplications  and  additions  required  for  an  N-point 
radix-2  FFT  can  be  expressed  as 

NRMULT  =  2Nlog2  N 

NRADDS  -  3MLog2  N  . 


The  radix-4  FFT  is  restricted  to  lengths  N  where  N  is  a  power  of  4 
(i.e.,  N  =  4r)  [6].  The  radix-4  algorithm  only  partially  decomposes  N  into  r 
4-point  DFTs.  The  4-point  DFT  also  requires  no  multiplications  as  shown  in 
Appendix  B.  Like  the  radix-2  FFT,  the  radix-4  FFT  requires  complex  multipli¬ 
cations  because  of  twiddle  factors.  However,  the  radix-4  FFT  requires  25% 
less  multiplications  than  the  radix-2  FFT  as  the  former  has  fewer  small  point 
DFTs  to  connect  with  twiddle  factors.  An  N-point  radix-4  FFT  requires 

NRMULT  =  <3N/2)log2  N  (f*)) 

NRADDS  «  (HN/4)log2  N  .  (61) 

The  MFFT  was  published  by  Singleton  in  1969  [7].  The  MFFT  can  compute  the 
DFT  of  any  sequence  length.  N  must  be  factored  as 

K=2r384t5up1miP2m2--Pk“k  .  (62) 

where  the  p^'s  represent  odd  prime  numbers.  The  arithmetic  reqirements  of  the 
MFFT  were  determined  [8]  to  be 

NRMULT  =  2rN  +  4sN  +  3tN  +  32uN/5  + 


I  [2(Pi-l)  +  (m^NCPi-l)2/?!  +  4(mi)N(pi-l)/pi]  -  4(N-1) 
i=l 

MtADDS  =  3rN  +  16sN/3  +  lltN/2  +  8uN  + 


I  [(Pi-D  +  7N(mi) (Pi-1) / Pi  +  (mi)N(Pi-l)2/Pi]  -  2(N-1)  .  (64) 

i=l 

For  the  comparison  purposes  of  this  report,  the  arithmetic  requirements  of  the 
MFFT  were  only  calculated  for  the  lengths  suitable  for  the  other  efficient 
algorithms.  The  arithmetic  requirements  based  on  the  restricted  factorization 

N  =  2r*3s*4t*5u*7w  ,  (65) 


can  be  expressed 

NRMULT  =  N(2r  +  4s  +  3t  +  32u/5  +  60w/7  -  4)  +  12w  +  4  (66) 

MtADDS  =  N(3r  +  16s/3  +  llt/2  +  8u  +  78w/7  -  2)  +  6w  +  2  .  (67) 

The  SWIFT  algorithm  is  based  on  the  standard  multidimensional  DFT  decom¬ 
position  which  results  when  all  the  factors  of  N  are  mutually  prime  [9].  As 
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shown  by  Equations  (43)  and  (44)  of  the  last  section,  no  twiddle  factors  are 
required  for  this  algorithm.  Thus,  as  discussed  in  the  last  section,  the 
arithmetic  requirements  of  an  N-point  SWIFT  algorithm  with  r  mutually  prime 
factors  are 


MtMULT  = 


»  &  kt 

i=l  **■ 


(68) 


NRADDS 


r  ai 

N  - 
i-1  % 


(69) 


The  SWIFT  algorithm  uses  efficient  small  point  DFT  algorithms  of  lengths 
2, 3, 4, 5, 6, 7, 8, 9,  and  16.  Table  3-1  gives  the  number  of  non-trivial  multipli¬ 
cations  and  additions  required  for  each  of  these  small  point  DFTs. 


TABLE  3-1. 


SWIFT  SHORT  DFT  REAL  OPERATIONS  REQUIREMENTS 


_N 

Pi 

«i 

2 

0 

4 

3 

4 

12 

4 

0 

16 

5 

16 

32 

7 

36 

60 

8 

4 

52 

9 

44 

88 

16 

24 

144 

A  listing  of  the  algorithms  is  given  in  Appendix  C.  The  different  mutually 
prime  combinations  of  these  small  point  DFTs  allow  the  SWIFT  algorithm  to  com¬ 
pute  DFTs  of  lengths  N  =  2  to  N  =  5040. 

The  PFA  [10-11]  is  also  based  on  the  standard  multidimensional  DFT  decom¬ 
position  which  results  when  the  factors  of  N  are  mutually  prime.  Accordingly, 
the  arithmetic  requirements  of  an  N-point  PFA  algorithm  with  r  mutually  prime 
factors  are 


r 

MtMULT  =  N  J 
i=l 


H 

% 


r 

MtADDS  *  R  I 
i=l 


(70) 


(71) 


The  PFA  also  uses  efficient  small  point  DFTs  of  lengths  2, 3, 4, 5, 7, 8, 9,  and  16. 
The  number  of  non-trivial  real  multiplications  and  additions  required  for  each 
of  these  small  point  DFTs  is  given  in  Table  3-2. 


TABLE  3-2.  PFA  SHORT  DFT  REAL  OPERATIONS  REQUIREMENTS 


_N 

Pi 

fi 

2 

0 

4 

3 

4 

12 

4 

0 

16 

5 

10 

34 

7 

16 

72 

8 

4 

52 

9 

20 

84 

16 

20 

148 

A  listing  of  the  algorithms  is  given  in  Appendix  B. 

The  WFTA  [12-17]  was  first  published  by  Dr.  Samuel  Winograd  in  the  mid¬ 
seventies.  Like  the  SWIFT  and  PFA,  the  WFTA  is  based  on  a  mutually  prime  fac¬ 
torization  of  N  resulting  in  no  twiddle  factors.  However,  the  WFTA  offers  an 
alternative  to  the  row/column  evaluations  of  Equation  (42)  used  in  the  SWIFT 
and  PFA.  The  WFTA  uses  the  special  structure  of  the  WFTA  short  DFT  transforms 
to  nest  all  the  multiplications  inside  of  input  and  output  additions.  The 
number  of  real  multiplications  required  of  an  N-pcint  WFTA  algorithm  with  r 
mutually  prime  factors  is 


r  r 

MMJLT  =  2TT *>±  ~  2TTPi  ,  (72) 

i=l  i=l 


where 

=  the  number  of  complex  multiplications  in  the  Nj-point  DFT 

Pi  =  the  number  of  multiplications  by  "1"  in  the  %-point  DFT. 

The  number  of  real  additions  required  [10]  for  two,  three,  and  four  factors  is 
expressed  in  Equations  (73),  (74),  and  (75),  respectively. 

NRADDS  -  2N3Y2  +  2«2y1  (73) 

NRADDS  -  2NiN2Y3  +  25 3 [ NiY 2  +  62Yll  (74) 

MtADDS  =  2N1N2N3Y4  +  aS^Np^Ys  +  63[NjY2  +  6 2Yll ]  »  (75) 

where  Y^  =  the  number  of  complex  additions  in  the  N^-point  DFT.  The  WFTA  also 
uses  efficient  small  point  DFTs  of  lengths  2, 3,4, 5, 7, 8, 9,  and  16.  The  total 
number  of  complex  multiplications,  the  number  of  multiplications  by  "1,"  and 
the  number  of  complex  additions  required  for  each  of  these  small  point  DFTs  is 
given  in  Table  3-3. 
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TABLE  3-3.  WFTA  SHORT  DFT  COMPLEX  OPERATIONS  REQUIREMENTS 
_N  £l  Pi  Yi 

2  2  2  2 

3  3  16 

4  4  3  8 

5  6  1  17 

7  9  1  36 

8  8  4  26 

9  11  1  44 

16  18  5  74 

The  ordering  of  the  factors  of  N  can  affect  the  number  of  real  additions 
required  by  the  WFTA.  In  this  report  the  optimum  ordering  of  the  factors  of  N 
was  always  used  to  calculate  WFTA  real  addition  requirements.  This  optimum 
ordering  is  shown  in  Tables  4-7  and  4-8.  A  listing  of  the  WFTA  small  point 
algorithms  is  given  in  Appendix  A. 

IV.  COMPARISON  OF  ALGORITHM  ARITHMETIC  REQUIREMENTS 

The  arithmetic  requirements  of  the  various  one-dimensional  DFT  algorithms 
are  given  in  this  chapter  for  lengths  N  =  2  to  N  *  5040.  In  addition,  the 
requirements  for  various  two-dimensional  DFT  algorithms  are  given  for  sizes 
ranging  from  2x2  to  90x90.  The  one-dimensional  requirements  for  the  DFT, 
radix-2  FFT,  radix-4  FFT,  MFFT,  SWIFT,  WFTA,  and  PFA  algorithms  are  compared 
in  Tables  4-1  through  4-8.  The  two-dimensional  requirements  for  the  custom 
DFT,  DFT,  radix-2  FFT,  MFFT,  SWIFT,  WFTA,  and  PFA  algorithms  are  compared  in 
Tables  4-9  through  4-12.  In  addition.  Table  4-13  summarizes  the  chapter  by 
listing  the  number  of  current  and  future  chips  required  for  various  one  and 
two-dimensional  transforms.  Tables  4-1  through  4-13  are  located  at  the  end  of 
this  chapter. 

Tables  4-1  and  4-2  give  the  total  number  of  real  operations  (i.e.,  the  sum 
of  the  real  multiplications  and  real  additions)  required  for  the  one¬ 
dimensional  DFT  algorithms  of  N  =  2  through  N  =  5040.  Measured  by  the  number 
of  real  operations,  the  DFT  is  by  far  the  least  efficient  algorithm.  For 
example,  the  DFT  requires  54,600%  of  the  operations  required  of  the  radix-2 
FFT  for  N  *  4096.  A  radix-4  FFT  requires  85%  of  the  real  operations  required 
by  the  radix-2  FFT.  The  MFFT  requires  fewer  real  operations  than  the  DFT  and 
the  two  FFTs.  In  addition,  the  MFFT  can  be  used  for  every  sequence  length 
listed  between  N  =  2  and  N  =  5040.  However,  the  MFFT  usually  requires  about 
10%  more  operations  than  the  PFA  and  WFTA.  Generally,  the  PFA  and  WFTA  are 
the  most  efficient  algorithms  for  the  lengths  between  N  *  20  and  b'  *■  5040. 

The  number  of  real  operations  required  for  the  PFA,  WFTA,  and  SWIFT  algorithms 
are  within  10%  of  the  number  required  by  the  best  algorithm  for  100%,  96%,  and 
44%  of  the  lengths  in  this  range,  respectively. 

Generally,  the  multiplication  operation  requires  more  time  and  hardware 
resources  than  the  addition  operation.  This  is  also  true  at  the  chip  level 
where  a  multiplier  requires  approximately  four  times  the  silicon  "real  estate" 
of  an  adder.  Accordingly,  a  weighted  index  of  arithmetic  complexity  is  par¬ 
ticularly  important  if  a  custom  chip  can  be  designed  to  match  the  requirements 
of  an  algorithm.  The  weighted  unit  shown  in  Tables  4-3  and  4-4  is  the  Total 


Equivalent  Real  Multiplications  (TERM).  The  TERM  unit  is  simply  the  total  of 
the  required  real  multiplications  added  to  one-fourth  the  number  of  required 
real  additions.  As  with  the  total  number  of  real  operations,  the  TERM  count 
shows  the  DFT  to  be  by  far  the  least  efficient  algorithm.  For  example,  the 
DPT  requires  from  700%  to  62,000%  of  the  TERM  of  comparable  radix-2  FFTs.  A 
radix-4  FFT  requires  about  80%  of  the  TERM  required  by  the  radix-2  FFT.  As 
before,  the  TERM  index  shows  the  MFFT  and  SWIFT  algorithms  to  be  more  effi¬ 
cient  than  the  DFT  and  FFT  algorithms.  However,  for  many  of  the  lengths,  the 
MFFT  and  SWIFT  algorithms  require  up  to  200%  and  150%  of  the  TERM  required  of 
the  WFTA  algorithm.  For  lengths  between  N  =  20  and  N  =  5040,  the  WFTA  is  the 
most  efficient  algorithm  for  93%  of  the  lengths  with  the  PFA  being  the  most 
efficient  algorithm  for  the  other  7%.  The  required  TERM  for  the  WFTA,  PFA, 
SWIFT,  and  MFFT  algorithms  are  within  10%  of  the  number  required  by  the  best 
algorithm  for  100%,  58%,  2%,  and  0%  of  the  lengths  in  this  range. 

Tables  4-5  and  4-6  give  the  number  of  real  multiplications  required  for 
one-dimensional  DFT  algorithms  of  lengths  N  =  2  through  N  =  5040.  Once  again, 
the  DFT  is  by  far  the  least  efficient  algorithm  in  terms  of  real  multiplica¬ 
tions.  For  example,  the  DFT  requires  53,300%  of  the  real  multiplications 
needed  for  the  radix-2  FFT  for  N  =  4096.  A  radix-4  FFT  requires  75%  of  the 
real  multiplications  required  by  the  radix-2  FFT.  The  MFFT  offers  con¬ 
siderable  savings  in  the  number  of  multiplications  required  compared  to  the 
DFT  and  the  FFT  algorithms.  However,  the  MFFT  is  never  within  10%  of  the 
arithmetic  requirement  of  the  most  efficient  algorithms  for  lengths  greater 
than  N  =  4.  The  WFTA  is  superior  at  minimizing  the  number  of  required 
multiplications.  The  WFTA  is  the  most  efficient  algorithm  in  terms  of  real 
multiplications  for  93%  of  the  lengths  between  N  =  20  and  N  =  5040.  The  PFA 
is  the  most  efficient  algorithm  for  the  other  7%  of  the  lengths.  The  percen¬ 
tages  of  the  lengths  in  this  range  at  which  the  MFFT,  SWIFT,  WFTA,  and  PFA 
algorithms  are  within  10%  of  the  most  efficient  algorithms  are  0%,  2%,  100%, 
and  9%,  respectively. 

Tables  4-7  and  4-8  give  the  number  of  real  additions  required  for  one¬ 
dimensional  DFT  algorithms  of  lengths  N  =  2  through  N  =  5040.  In  terms  of 
real  additions,  the  DFT  is  the  least  efficient  algorithm  followed  in  order  of 
increasing  efficiency  by  the  radix-2  FFT  and  the  radix-4  FFT.  The  radix-4  FFT 
requires  92%  of  the  real  additions  required  of  the  radix-2  FFT.  The  SWIFT 
algorithm  is  the  most  efficient,  or  as  efficient,  as  any  other  algorithm  for 
98%  of  the  lengths  N  =  20  to  N  =  5040.  The  percentages  of  the  lengths  in  this 
range  at  which  the  SWIFT,  PFA,  WFTA,  and  MFFT  algorithms  are  within  10%  of  the 
most  efficient  algorithms  are  98%,  76%,  33%,  and  20%,  respectively. 

The  thrust  of  this  report  has  been  one-d.  rcensional  DFT  algorithms. 

However,  two-dimensional  DFT  algorithms  can  be  easily  implemented  with  one¬ 
dimensional  algorithms  using  the  row/column  technique.  Using  this  procedure, 
one-dimensional  DFT  transforms  are  performed  on  all  the  rows,  followed  by  one¬ 
dimensional  transforms  performed  on  all  the  columns  of  data  resulting  from  the 
row  transforms.  Thus,  the  arithmetic  requirements  of  a  row/column  implemen¬ 
tation  of  an  ftcN  DFT  algorithm  is  simply  2N  times  the  requirements  of  the 
selected  N-point  one-dimensional  DFT  algorithm. 

True  two-dimensional  FFT  algorithms  have  been  developed  which  do  not  rely 
on  one-dimensional  tranforms  [18].  These  algorithms  generally  require  less 
complex  multiplications  than  the  row/column  methods.  However,  they  are  harder 


to  Implement  and  are  less  universal  than  the  one-dimensional  algorithms. 
Algorithms  have  also  been  developed  for  computing  the  two-dimensional  DFT  of 
arrays  whose  elements  do  not  have  rectangular  spacing  [19].  Refinement  and 
extension  of  this  work  is  very  important  to  radar  digital  beam  forming  efforts 
as  most  phased  array  antennas  have  triangularly  spaced  elements.  Although 
important,  an  in-depth  examination  of  these  algorithms  is  beyond  the  scope  of 
this  report. 

The  arithmetic  requirements  of  a  one-dimensional  DFT  are  the  same  whether 
the  coefficients  are  the  standard  ones  of  Equation  (1)  or  those  selected  for 
individual  custom  responses.  However,  if  the  row/column  method  is  selected 
for  the  two-dimensional  DFT,  only  standard  coefficients  can  be  used.  The 
custom  shaping  of  the  response  of  each  transform  output  point  afforded  by  the 
custom  two-dimensional  DFT  requires  a  weighted  sum  of  all  the  elements  in  the 
NxN  data  array.  Each  transform  output  point  can  have  a  unique  fckN  array  of 
coefficients  exhibiting  none  of  the  symmetrical  properties  of  the  standard  DFT 
coefficients.  Computing  each  custom  DFT  output  point  requires  4N^  real 
multiplications  and  4N^-2  real  additions.  Computing  all  of  the  transform 
outputs  therefore  requires  4N^  real  multiplications,  4N^-2N^  real  additions, 
and  8N^-2tj2  total  real  operations. 

The  number  of  total  real  operations,  TERM,  real  multiplications,  and  real 
additions  for  the  custom  DFT,  DFT,  radix-2  FFT,  MFFT,  SWIFT,  WFTA,  and  PFA 
two-dimensional  algorithms  for  array  sizes  2x2  through  90x90  are  shown  in 
Tables  4-9  through  4-12.  The  arithmetic  requirements  shown  in  the  tables  are 
for  the  row/column  method  except  for  the  custom  DFT. 

As  the  tables  indicate,  the  arithmetic  requirements  for  the  two- 
dimensional  custom  DFT  are  enormous.  However,  if  the  number  of  desired  custo¬ 
mized  transform  output  points  is  a  small  percentage  of  N?,  this  algorithm  can 
be  useful.  For  example,  if  only  four  customized  transform  output  points  were 
required  from  an  8x8  data  array,  2040  real  operations  would  be  required.  To 
get  four  non-customized  transform  points  from  the  radix-2  FFT  would  require 
the  1920  real  operations  needed  to  compute  all  the  output  points.  The  rela¬ 
tive  efficiency  of  the  row/coluran  algorithms  is  the  same  as  the  relative  effi¬ 
ciency  of  the  one-dimensional  algorithms  as  the  two-dimensional  requirements 
are  simply  2N  times  that  of  the  one-dimensional  requirements  discussed 
earlier.  As  in  the  one-dimensional  case,  there  are  considerable  differences 
in  arithmetic  complexity  among  the  two-dimensional  algorithms.  For  example, 
the  30x30  custom  DFT,  DFT,  and  WFTA  algorithms  require  6,478,200  real  opera¬ 
tions,  428,400  real  operations,  and  27,120  real  operations,  respectively.  The 
differences  are  even  greater  for  the  larger  arrays.  For  example,  the  90x90 
custom  DFT,  DFT,  and  PFA  algorithms  require  524,863,800  real  operations, 
11,631,600  real  operations,  and  369,360  real  operations,  respectively. 

It  is  difficult  to  project  the  exact  hardware  size  and  cost  for  the 
various  algorithms  based  solely  on  their  arithmetic  requirements.  An  analysis 
of  the  memory  requirements,  software  complexity,  optimum  architectures,  and 
availability  of  special  purpose  integrated  circuits  for  each  algorithm  and 
array  size  is  beyond  the  scope  of  this  report.  However,  a  brief  review  of 
present  and  near  term  arithmetic  capabilities  of  digital  integrated  circuits 
will  give  insight  into  the  feasibility  of  implementing  the  various  algorithms 
for  different  array  sizes. 
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Currently,  TRW  offers  8-bit  and  16-bit  multiplier/accumulator  (MAC)  chips 
which  provide  real  multiplication  and  addition  rates  of  14  MHz  and  9  MHz, 
respectively.  These  two  TRW  chips  are  packaged  in  dual-in-line  packages  with 
pin  counts  of  48  pins  and  64  pins.  Depending  on  the  algorithm  and  required 
operating  speeds,  these  chips  can  be  multiplexed  to  reduce  the  total  chip 
count. 

Dramatic  integrated  circuit  performance  increases  are  expected  in  the  near 
future  as  a  result  of  the  Department  of  Defense’s  Very  High  Speed  Integrated 
Circuit  (VHSIC)  program.  The  $325  million,  seven  year  long  program  which 
began  in  March  of  1980,  was  designed  to  provide  a  fifty-fold  improvement  in 
high  speed,  high  throughput  signal  and  data  processing  integrated  circuits. 

By  the  end  of  phase  I  of  the  program  in  mid-1984,  six  contractors  will  provide 
a  pilot  line  production  of  chips  with  1.25  micron  architectural  features, 
minimum  throughput  rates  of  25  MHz,  and  a  minimum  functional  throughput  rate 
(FTR)  of  5x10^-*-  gate-Hz/cm^.  The  pilot  line  production  of  chips  with  .5  to  .8 
micron  architectural  features,  minimum  throughput  rates  of  100  MHz,  and  a 
minimum  FTR  of  10*3  gate-Hz/cm^  will  be  required  by  the  completion  of  phase  II 
of  the  program  in  1987  [20] . 

Several  phase  I  VHSIC  contractors  will  produce  MAC  chips.  Preliminary 
reports  indicate  that  IBM  will  produce  a  complex  multiplier/accuraulator  (CMAC) 
chip.  This  implies  a  one-chip  capability  of  performing  a  simultaneous  set  of 
approximately  eight  real  operations  (i.e.,  four  real  multiplications  and  four 
real  additions)  at  a  25  MHz  rate  [21].  Westinghouse,  another  VHSIC  contractor, 
plans  to  build  a  complex  number  arithmetic  vector  processor  capable  of  per¬ 
forming  40  million  complex  number  operations/ second,  which  would  only  require 
two  6x8  in.  printed  circuit  boards.  In  addition,  Westinghouse  is  designing  a 
ten-board  array  type  processor  capable  of  psrforming  200  million  complex 
number  operations/sec.  or  more  than  one  billion  real  number  operations/ 
sec.  [22].  In  addition  to  the  VHSIC  program,  commercial  very  large  scale 
integration  (VLSI)  chips  produced  with  VHSIC  technology  are  expected  to  pro¬ 
vide  VHSIC-like  arithmetic  capabilities. 

A  convenient  way  to  compare  the  chip  capabilities  and  algorithm  require¬ 
ments  is  to  use  the  units:  (1)  millions  of  real  multiplications/sec  (MMPS), 

(2)  millions  of  real  additions/sec  (MAPS),  and  (3)  millions  of  total  equiva¬ 
lent  real  multiplications/sec  (TERMS).  For  example,  the  16-bit  TRW  MAC  chip 
is  capable  of  9  MMPS  and  9  MAPS.  The  IBM  VHSIC  CMAC  chip  will  offer  roughly 
an  eleven-fold  improvement  at  100  MMPS  and  100  MAPS  when  developed.  A 
hypothetical  custom  VHSIC/VLSI  chip  with  at  least  125  TERMS  of  arithmetic 
capability  should  be  available  by  1984.  For  the  comparisons  in  this  report, 
the  time  required  to  perform  the  transform  will  be  arbitrarily  assumed  to  be  1 
sec.  This  choice  of  time  makes  the  number  of  real  multiplications,  addi¬ 
tions,  and  TERM  found  in  the  tables  equal  to  the  number  of  MMPS,  MAPS,  and 
TERMS,  respectively.  For  example,  computing  a  64-point  DFT  in  1  sec  requires 
16,384  MMPS,  16,256  MAPS,  and  20,448  TERMS.  As  the  MMPS  requirement  of  the 
64-point  DFT  is  more  demanding  than  the  MAPS  requirements,  the  former  dictates 
the  use  of  1,821  TRW  MACs  or  164  IBM  CMACs.  The  TERMS  numbers  predict  that 
164  custom  VHSIC/VLSI  chips  would  be  required. 

Using  the  assumptions  and  methodology  of  the  previous  paragraph,  Table 
4-13  was  constructed  to  estimate  the  relative  number  of  TRW,  IBM  VHSIC,  and 
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custom  VHSIC/VLSI  chips  required  to  meet  the  arithmetic  requirements  of 
various  one  and  two-dimensional  DFTs,  radix-2  FFTs,  and  WFTAs.  The  chip  count 
does  not  include  non-arithmetic  chips  necessary  for  implementation,  such  as 
control  and  memory  chips.  However,  a  rough  estimate  of  the  number  of  required 
arithmetic  chips  can  be  found  simply  by  scaling  the  chip  count  in  the  table  by 
the  ratio  of  1  psec  and  the  desired  transform  time.  This  table  summarizes  the 
relative  differences  of  complexity  among  the  various  algorithms  and  suitabil¬ 
ity  of  current  and  proposed  hardware.  For  example,  the  DFT  shown  in  the 
table  requires  more  arithmetic  chips  than  any  other  algorithm  except  the 
custom  two-dimensional  DFT.  As  shown  in  the  table,  the  custom  VHSIC/VLSI  chip 
offers  no  significant  reductions  in  the  DFT  chip  count.  This  illustrates  that 
the  TRW  and  IBM  chips  are  well  suited  to  the  DFT.  In  contrast,  the  WFTA 
requires  fewer  arithmetic  chips,  although  it  is  not  particularly  well  suited 
to  the  TRW  and  IBM  chips.  Roughly  a  three-fold  improvement  is  gained  using 
custom  VHSIC/VLSI  chips  tailored  to  the  WFTA’s  required  multiplication  to 
addition  ratio.  The  radix-2  FFT  compares  surprisingly  well  with  the  WFTA  when 
implemented  with  the  MAC  and  CMAC  chips  of  TRW  and  IBM.  The  extra  arithmetic 
chips  required  by  the  radix-2  FFT  would  probably  be  offset  by  the  extra 
control  and  memory  chips  required  by  the  more  structurally  complex  WFTA 
algorithm.  However,  if  custom  chips  are  available,  the  radix-2  FFT  would 
generally  require  200%  to  300%  of  the  arithmetic  chips  required  by  the  WFTA. 
The  PFA  and  radix-4  FFT  algorithms  are  not  shown  in  the  table  as  they  are  very 
close  to  the  numbers  given  for  the  WFTA  and  radix-2  algorithms,  respectively. 
Likewise,  the  MFFT  and  SWIFT  algorithms  are  not  shown  as  they  reside  between 
the  radix-2  FFT  and  the  WFTA  in  performance. 
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Appendix  A.  WFTA  SHORT  DFT  ALGORITHMS 

Algorithms  are  given  to  compute  the  DFT  for  lengths  2,  3,  4,  5,  7,  8,  9 
and  16.  These  algorithms  were  taken  from  [23],  but  have  been  credited  to  the 
work  of  Radar  and  Winograd  [23,13].  The  complex  input  data  x(l),  x(2),..., 
x(N)  and  complex  output  data  X(l),  X(2),...,  X(N)  are  in  natural  order.  The 
complex  values  Ml,  M2,...,  MM  are  the  results  of  the  M  complex  multiplications 
required  for  the  small  point  transform.  The  complex  values  Tl,  T2,..«  and 
SI,  S2,...  are  temporary  values  derived  from  the  input  data  and  intermediate 
results,  respectively.  Generally,  the  operations  must  be  performed  in  the 
order  listed.  The  total  number  of  trivial  and  non-trivial  complex  multiplica¬ 
tions  and  additions  required  for  each  DFT  is  listed  with  the  algorithm.  In 
addition,  the  number  of  complex  multiplications  by  W3  or  "1”  is  given  in 
parentheses. 

(1)  N=2;  2  complex  multiplications  (2),  2  complex  additions. 

Ml=l*(x( 1 )+x(2 ) ) 

M2=l*(x(l)-x(2)) 

X(1)=M1 

X(2)=M2 

(2)  N=3 ;  3  complex  multiplications  (1),  6  complex  additions,  u*2x/3. 

Coefficients:  Cl=-3/2 

C2=jsin  u 

Tl=x(2)+x(3) 

Ml=l*(x(l)+Tl) 

M2=C1*T1 

M3=C2*(x( 3 )-x( 2 ) ) 

Sl=Ml+M2 

X(l)=Ml 

X(2)=S1+M3 

X(3)=S1-M3 

(3)  N=4;  4  complex  multiplications  (3),  8  complex  additions. 

Tl=x(] )+x(3) 

T2=x(2)+x(4) 

M1=1*(T1+T2) 

M2=1*(T1-T2) 

M3=l*(x(l)-x(3)') 

M4=j*(x(4)-x(2)' 

X(1)=M1 
X( 2 )=M3+M4 
X(3)=M2 
X(4)=M3-M4 

(4)  N=5;  6  complex  multiplications  (1),  17  complex  additions,  u»2x/5. 

Coefficients:  Cl=-5/4 

C2=(cos  u-cos  2u)/2 
C3=-j  sin  u 
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C4— j(sin  u+sin  2u) 
C5*j(sin  u-sin  2u) 


Tl“x(2)+x(5) 

T2=x(3)+x(4) 

T3=x(2)-x(5) 

T4=«x(4)-x(3) 

T5=>T1+T2 

Ml=*l*(x(l)+T5) 

M2=C1*T5 

M3=C2*(T1''T2) 

M4»C3*(T3+T4) 

M5=C4*T4 

M6=C5*T3 

S1=M1+M2 

S2=S1+M3 

S3=M4-M5 

S4=S1~M3 

S5=M4+M6 

X(1)=M1 

X(2)=S2+S3 

X(3)=S4+S5 

X(4)=S4-S5 

X(5)=S2-S3 


(5)  1^=7 ;  9  complex  multiplications  (1),  36  complex  additions,  u*2x/7. 


Coefficients: 


3 

a 

Tl=x(2)+x(7) 

T2=x(3)+x(6) 

T3=x(4)+x(5) 

T4=T1+T2+T3 

,  Vf 

y 

l 

i  4 

T5=x(2)-x(7) 

T6=x(3)-x(6) 

T7=x(5)-x(4) 

a 

T8=T1-T3 

T9=T3-T2 

T10=T5+T6+T7 

L  J 
& 

T11=T7-T5 

f 

T12=T6-T7 

T13—T8-T9 

T14=-'T11''T12 

Ml=l*(x(1.)+T4) 

M2=Cl*T4 

k 

rV« 

M3=C2*T8 

M4=C3*T9 

Cl— 7/6 

C2“(2cos  u-cos  2u*>cos  3u)/3 
C3=(cos  u-2co8  2u+co8  3u)/3 
C4=(co8  u+cos  2u*'2co8  3u)/3 
C5*-j(sin  u+sin  2u-sin  3u)/3 
C6=j(2sin  u-sin  2u+sin  3u)/3 
C7«3j(sin  u-^sin  2u~sin  3u)/3 
C8=*j(sin  u+sin  2u+2sin  3u)/3 
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M5=C4*T13 

M6=C5*T10 

M7=C6*T11 

M8=G7*T12 

M9=C8*T14 

51— M3-M4 

52— M3-M5 

53—  M7-M8 
S4=M7+M9 
S5=M1+M2 
S6=S5-S1 
S7=S5+S2 
S8=S5+S1-S2 
S9=M6-S3 
S10=M6-S4 
S11=M6+S3+S4 
X(l)=Ml 
X(2)=S6+S9 
X(3)=S7+S10 
X(4)=S8-S11 
X(5)=S8+S11 
X(6)=S7-S10 
X(7)=‘S6*'S9 

N=8;  8  complex  multiplications  (4),  26  complex  additions,  u-2x/8 


Coefficients:  Cl»cos  u 

C2=-jsin  u 


Tl=x(l)+x(5) 

T2=x(3)+x(7) 

T3=x(2)+x(6) 

T4=x(2)-'x(6) 

T5=x(4)+x(8) 

T6=x(4)-x(8) 

T7=>T1+T2 

T8=T3+T5 

M1=1*(T7+T8) 

M2=1*(T7-T8) 

M3=1*(T1-T2) 

M4=l* (x( 1 )~x( 5 ) ) 

M5=C1*(T4-T6) 

M6=j*(T5-T3) 

M7=j*(x(7)-x(3)) 

M8=C2*(T4+T6) 

S1=M4-W5 

S2=M4-M5 

S3=M7+M8 

S4=M7-M8 

X(1)=M1 

X(2)=S1+S3 

X(3)=M3-K16 

X(4)=S2-S4 

X(5)=M2 
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X(6)=S2+S4 

X(7)=M3-M6 

X(8)=S1-S3 

(7)  n=9;  11  complex  multiplications  (1),  44  complex  additions,  u-2n/9. 

Coefficients:  Cl=3/2 
C2--1/2 
C3=cos  u 
C4=~cos  4u 
C5=>—cos  2u 
C6“''jsin  3u 
C7=jsin  u 
C8=jsin  4u 
C9=jsin  2u 

Tl«x(2)+x(9) 

T2=x(3)+x(8) 

T3=x(4)+x(7) 

T4“x(5)+x(6) 

T5-T1+T2+T4 

T6=x(2)-x(9) 

T7=x(8)-x(3) 

T8«x(4)-x(7) 

T9=x(5)-x(6) 

T10=T6+T7+T9 

TH=ti-T2 

T12=T2-T4 

T13=T7-T6 

T14=T7-T9 

T15="T12-T11 

T16=-T13+T14 

M1=1*(x(1)+T3+T5) 

M2=C1*T3 

M3=C2*T5 

M4=C3*T11 

M5=C4*T12 

Mf~C5*T15 

M7=C6*T10 

M8=C6*T8 

M9=C7*T13 

M10=C8*T14 

M11=C9*T16 

Sl=-M4-M5 

S2=M6"M5 

S3=-M9-M10 

S4=M10-M1 1 

S5=M1+M34M3 

S6=S5-M2 

S7=S5«13 

S8=?6-S1 

S9=S2+S6 

S10=S1-S2+S6 

S11=M8-S3 


•  .1 
•3 

'+i 

Ij 


"  s 
v; 


ft 
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.  S12*M8-S4 
S13-M8+S3+S4 
X(1)=M1 
X(2)*»S8d-Sll 
X(3)=S9-S12 
X(4)=»S7+M7 
X(5)-S10+S13 
X(6)=S10~S13 
X(7)=S7-M7 
X(8)=S9+S12 
X(9)=S8"S11 

(8)  N=16;  18  complex  multiplications  (5),  74  complex  additions,  u“2it/16. 

Coefficients:  Cl=cos  2u 
C2*cos  3u 
C3=cos  u+coe  3u 
C4*cos  3u-co8  u 
C5=-jsin  2u 
C6“-jsin  3u 
C7*j(sin  3u~sin  u) 

C8="j(8in  u+sin  3u) 

Tl=x(l)+x(9) 

T2=x(5)+x(13) 

T3=x(3)+x(ll) 

T4=x(3)"X(ll) 

T5=x(7)+x(15) 

T6=x(7)-x(15) 

T7=x(2)+x(10) 

T8=x(2)-x(10) 

T9=x(4)+x(12) 

T10=x(4)-x(12) 

Tll=x(6)+x(l4) 

T12=x(6)-x(  14) 

T13=x(8)+x(16) 

T14=x(8)-x(16) 

T15=T1+T2 

T16=T3+T5 

T17=T15+T16 

T18=T7+T11 

T19=T7-T11 

T20=T9+T13 

T21=T9-'T13 

T22=T18+T20 

T23=T8+T14 

T24=T8-T14 

T25=T10+T12 

T26=T12-T10 

Ml=!*(T17+T22) 

M2=1*(T17-T22) 

M3=1*(T15-T16) 

H4=1*(T1-T2) 

M5=l*(x(l)*'x(9)) 
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M6=C1*(T19-T21) 

M7=C1*(T4-'T6) 

M8=C2*(T24+T26) 

M9=C3*T24 

M10=C4*T26 

Mll=j*(T20-T18) 

Hl2=j*(T5-'T3) 

M13«j*(x(13)-x(5)) 

M14=C5*(T19+T21) 

M15=C5*(T4+T6) 

M16=C6*(T23+T25) 

M17=C7*T23 

M18=C8*T25 

S1=M4+M6 

S2=M4-M6 

S3=M12+M14 

S4=M14-M12 

S5=M54M7 

S6=M5-M7 

S7=M9-M8 

S8=M10-M8 

S9=S5+S7 

S10=S5-S7  • 

S11=S6+S8 

S12=S6-S8 

S13=M13+M15 

S14=M13-M15 

S15=M16+M17 

S16=M16-M18 

S17-S13+S15 

S18=S13-S15 

S19=S14+S16 

S20=S14"S16 

X(1)=M1 

X(2)-S9-l 

X(3)=S1+S_ 

X(4)=S12~320 

X(5)-M3+M11 

X(6)=S11+S19 

X(7)=S2+S4 

X(8)=S10-S18 

X(9)=M2 

X(10)=*S10+S18 

X(li)=S2-S4 

X(12)=>S11“S19 

X(13)=M3-M11 

X(14)=S12+S20 

X(15)=S1-S3 

X(16)"S9-S17 


Appendix  B.  PFA  SHORT  DFT  ALGORITHMS 

The  following  algorithms  compute  the  DFT  for  lengths  2,  3,  4,  5,  7,  8,  9, 
and  16.  These  algorithms  were  taken  from  Burrus  and  Eschenbacher  [11]  •  They 
are  part  of  a  complete  Fortran  listing  of  a  general  purpose  PFA  program.  In 
contrast  to  Appendix  A,  these  algorithms  are  written  in  terms  of  real 
multiplications  and  additions.  In  addition,  no  trivial  multiplications  are 
used  in  these  algorithms.  The  real  and  imaginary  parts  of  the  complex  input 
data  are  represented  in  natural  order  by  XR(1),  XR(2 ),...,  XR(N)  and  XI (1), 
XI(2),...,  XI(N),  respectively.  The  complex  output  is  stored  in  natural  order 
in  the  XR(I)  and  XI(I)  arrays.  The  values  Ul,  U2,...,  Tl,  T2,...,  Rl,  R2,..., 
and  SI,  S2,...  are  all  temporary  values  derived  from  input  data  and  inter~ 
mediate  results.  Generally,  the  operations  must  be  performed  in  the  order 
listed.  The  total  number  of  real  multiplications  and  additions  required  for 
each  DFT  is  listed  with  each  algorithm. 

(1)  N=2;  0  real  multiplications,  4  real  additions 

T1=XR(1) 

XR(1)=T1+XR(2) 

XR(2)=T1-XR(2) 

T1=XI(1) 

XI(1)=T1+XI(2) 

XI(2)=T1-XI(2) 

(2)  N=3;  4  real  multiplications,  12  real  additions,  u«2n/3. 

Coefficients:  Cl=sin  u 
C2=l/2 

T1=(XR(2)-XR(3))*C1 

U1=(XI(2)-XI(3))*C1 

R1=XR(2)+XR(3) 

S1=XI(2)+XI(3) 

T2=XR(1)-R1*C2 

U2=XI(1)-S1*C2 

XR(1)=XR(1)+R1 

XI(1)=XI(1)+S1 

XR(2)=T2+Ul 

XR(3)=T2-U1 

XI(2)=U2-T1 

XI(3)=U2+T1 

(3)  N=4,  0  real  multiplications,  16  real  additions. 

R1=XR( 1 )+XR ( 3 ) 

R2=XR(1)-XR(3) 

S1=XI(1)+XI(3) 

S2=XI(1)-XI(3) 

R3=XR ( 2 )+XR ( 4 ) 

R4=XR(2)-XR(4) 

S3=XI(2)+XI(4) 

S4=XI(2)'-XI(4) 

XR(1)=R1+R3 
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XR(3)=R1-R3 

XI(1)»S1+S3 

XI(3)=S1-S3 

XR(2)=R2+S4 

XR(4)=R2-S4 

XI(2)=S2"R4 

XI(4)=S2+R4 

(4)  N=5,  10  real  multiplications,  34  real  additions,  u=2u/5. 

Coefficients:  Cl=sin  u 

C2=sin  u+sin  2u 
C3=sin  u-sin  2u 
C4=(cos  u-cos  2u)/2 
C5=-5/4 

R1=XR(2)+XR(5) 

R2=XR(2)-XR(5) 

S1=XI(2)+XI(5) 

S2=XI(2)-XI(5) 

R3=XR(3)+XR(4) 

R4=XR(3)-XR(4) 

S3=XI(3)+XI(4) 

S4=XI(3)"XI(4) 

T1=(R2+R4)*C1 
U1=(S2+S4)*C1 
R2=T1-R2*C2 
S2=>U1-'S2*C2 
R4=Tl-'R4*C3 
S4=U1-S4*C3 
T1=(R1-R3)*C4 
U1=(S1-S3)*C4 
T2=R1+R3 
U2=S1+S3 
XR(1)=XR(1)+T2 
XI(1)=XI(1)+U2 
T2=XR(1)+T2*C5 
U2=XI ( 1 )+U2*C5 
R1=T2+T1 
R3=T2-T1 
S1=U2+U1 
S3=U2-U1 
XR(2)=R1+S4 
XR(5)=R1-S4 
XI(2)=S1”R4 
XI(5)=S1+R4 
XR(3)=*R3-S2 
XR(4)=R3+S2 
XI(3)=S3+R2 
XI(4)=S3-R2 


(5)  N=7,  16  real  multiplications,  72  real  additions,  u<=2it/7 

Coefficients:  Cl=-7/6 


C2«(2cos  u-cos  2u-cos  3u)/3 
C3“(co8  u-2cos  2u+cos  3u)/3 
C4=(cos  u+co8  211-2008  3u)/3 
C5=(sin  u+sin  2x1-81x1  3u)/3 
C6=(2sin  u-sin  2u+sin  3u)/3 
C7=(-sin  u+2sin  2u+sin  3u)/3 
C8=(sin  u+sin  2u+  2sin  3u)/3 


R1=XR(2)+XR(7) 
R2=XR(2)-XR(7) 
S1=XI(2)+XI(7) 
i>2=XI(2)-XI(7) 
R3=XR( 3 )+XR ( 6 ) 
R4=XR(3)-XR(6) 
S3=XI(3)+XI(6) 
S4=XI(3)-XI(6' 
R5=XR(4)+XR(5) 
R6=XR(4)-XR(5) 
S5=XI(4)+XI(5) 
S6=XI(4)-XI(5) 
T1=R1+R3+R5 
U1=S1+S3+S5 
XR(1)=XR(1)+T1 
XI(1)=XI(1)+U1 
T1=XR(1)+C1*T1 
U1=XI(1)+C1*U1 
T2=C2*(R1-R5) 
U2=C2*(Sl-S5) 
T3=C3*(R5-R3) 
U3=C3*(S5-S3) 
T4=C4*(R3-R1) 
U4=C4*(S3-S1) 
R1=T1+T2+T3 
R3=T1-T2-T4 
R5=T1-T3+T4 
S1=U1+U2+U3 
S3=U1-U2-U4 
«?S=U1-U3+U4 
? =C5*(S2+S4-S6) 
1 .1 =C  5*  (  R2+R4-R6  ) 
T2=C6*(R2+R6) 
U2=C6*(S2+S6) 
T3=C7*(R4+R6) 
U3=C7*(S4+S6) 
T4=C8*(R4-R2) 
U4=C8*(S4-S2) 
R2=T1+T2+T3 
R4=Tl-T2-T4 
R6=-T1-T3+T4 
S2=U1+U2+U3 
S4=U1-U2-U4 
S6=Ul-U3+U4 
XR(2)=Rl+S2 
XR(7)=Rl-S2 


XI(2)=S1-R2 

XI(7)=S1+R2 

XR(3)=R3+S4 

XR(6)=R3-S4 

XI(3)=S3-R4 

XI(6)=S3+R4 

XR(4)=R5-S6 

XR(5)=R5+S6 

XI(4)=S5+R6 

XI(5)=S5~R6 

(6)  N=8,  4  real  multiplications,  52  real 

Coefficients:  Cl*sin  u 

R1=XR(1)+XR(5) 

R2=XR(1)-XR(5) 

S1=XI(1)+XI(5) 

S2=XI(1)-XI(5) 

R3=XR(2)+XR(8) 

R4=XR(2)-XR(8) 

S3=X1(2)+XI(8) 

S4=XI(2)~XI(8) 

R5=XR(3)+XR(7) 

R6=XR(3)-XR(7) 

S5=XI(3)+XI(7) 

S6=XI ( 3 )"XI ( 7 ) 

R7=XR(4)+XR(6) 

R8=XR(4)-XR(6) 

S7=XI(4)+XI(6) 

S8=XI(4)-XI(6) 

T1=R1+R5 

T2=R1-R5 

U1=S1+S5 

U2=S1-S5 

T3=R3+R7 

R3=C1*(R3-R7) 

U3=S3+S7 

S3=C1*(S3-S7) 

T4=R4-R8 

R4=Cl*(R4-fR8) 

U4=S4-S8 

S4=C1*(S4+S8) 

T5=R2+R3 

T6=R2-R3 

U5=S2+S3 

U6=S2-S3 

T7=R4+R6 

T8=R4-R6 

U7=S4+S6 

U8=S4-S6 

XR(1)=T1+T3 

XR(5)=T1-T3 

XI(1)=U1-HJ3 


dditions, 


u=2-rt/8 . 


.1 

i 


i 

r? 

i 

cy] 

kT* 

eo 

p 

pi 


XI(5) 
XR(2) 
XR(8) 
XI(2) 
XI(8) 
XR(3) 
XR(7) 
XI(3) 
XI(7) 
XR(4) 
XR(6) 
XI(4) 
XI  (6) 


U1-U3 

T5+U7 

T5-U7 

U5-T7 

■U5+T7 

T2-HJ4 

T2-U4 

U2-T4 

U2+T4 

T6+U8 

T6-U8 

U6-T8 

U6+T8 


(7)  N=9,  20  real  multiplications,  84  real  additions,  u=2it/9. 


Coefficients: 


1 

C3=- 

C4=- 

C5=c 

C6=- 

C7-- 

o 

C8=- 

Rl=XR(2)+XR(9) 

R2=XR(2)-XR(9) 

S1=XI(2)+XI(9) 

Sg 

S2=XI(2)-XI(9) 

R3=XR(3)+XR(8) 

i 

R4=XR(3)-XR(8) 

S3=XI(3)+XI(8) 

3 

S4=XI ( 3 )-XI ( 8 ) 

3 

R5=XR(4)+XR(7) 

M 

a 

T1=C1*(XR(7)-'XR(4)) 

S5=XI(4)+XI(7) 

•4 

U1=C1*(XI(7)~XI(4)) 

1 

R7=XR(5)+XR(6) 

m 

R8=XR(5)-XR(6) 

S7=*XI(5)+XI(6) 

a 

S8=XI(5)*'XI(6) 

N 

R9=XR(1)+R5 

S9=XI(1)+S5 

i  . 

T2=XR(1)-R5*C2 

v«5 

U2=XI(1)-S5*C2 

T3=(R3-R7)*C3 

U3=(S3-S7)*C3 

$■* 

T4=(R1-R7)*C4 

U4=(S1-S7)*C4 

i 

T5=(R1-'R3)*C5 

U5=(S1-S3;*C5 

c*  ? 
o 

R10=R14R3+R7 

S10=S1+S3+S7 

R1=T2+T3+T5 

Cl=sin  3u 
C2=l/2 

cos  4u 
•CO  8  2u 


ui 


■■  '.v.^  * 


,  »  ,  • 


U3 


R3=T2-T3-T4 

R7=*T2+T4-T5 

S1=U2-HJ3+U5 

S3=U2-U3-U4 

S7=U2+U4-U5 

XR(1)=R9+R10 

XI(1)=S9+S10 

R5=R9-R10*C2 

S5=S9-S10*C2 

R6=~ ( R2-R4+R8 ) *C1 

S6=-(S2-S4+S8)*C1 

T3=(R4+R8)*C6 

U3=(S4+S8)*C6 

T4-(R2-R8)*C7 

U4=(S2-S8)*C7 

T5-(R24R4)*C8 

U5=(S2+S4)*C8 

R2=T1+T3+T5 

R4=T1-T3-T4 

R8=*T1+T4~T5 

S2=U1-HJ3+U5 

S4=U1-U3-U4 

S8=U1+U4-U5 

XR(2)=R1-S2 

XR(9)~R1+S2 

XI(2)=S1+R2 

XI(9)=S1-R2 

XR(3)=R3+S4 

XR(8)=R3-S4 

XI(3)=S3-R4 

XI(8)=S3+R4 

XR(4)=R5-S6 

XR(7)=R5+S6 

XI(4)=S5+R6 

XI(7)=S5"R6 

XR(5)=R7-S8 

XR(6)=R7+S8 

XI(5)=S7+R8 

XI(6)=S7-R8 

(8)  N=16,  20  real  multiplications,  148 

Coefficients:  Cl=sin  2u 
C2*sin  u 
C3=cos  u+sin  u 
C4=cos  u-sin  u 
C5=*cos  u 

R1=XR(1)+XR(9) 

R2=XR(1)-XR(9) 

S1=XI(1)+XI(9) 

S2=XI(1)*'XI(9) 

R3=XR(2)+XR(10) 

R4=XR ( 2 )"XR( 10 ) 


ial  additions 


u=2it/16. 


I 


I 

i 

$ 


.y 


S3=XI(2)+n(10) 

S4=XI(2)-XI(10) 

R5=XR(3)+XR(11) 

R6=XR(3)-XR(11) 

S5=XI(3)+XI(11) 

S6=XI(3)-XI(11) 

R7=XR(4)+XR(12) 

R8=XR(4)~XR(12) 

S7=XI(4)+XI(12) 

S8=XI(4)'*XI(12) 

R9=XR(5)+XR(13) 

R10=XR(5)-XR(13) 

S9=XI(5)+XI(13) 

sio=xi(5)~xi(13) 

R11=XR(6)+XR(14) 

R12=XR(6)-XR(14) 

S11=XI(6)+XI(14) 

S12=XI(6)-XI(14) 

R13=XR(7)+XR(15) 

R14=XR(7)-XR(15) 

S13-XI(7)+XI(15) 

S14=XI(7)-XI(15) 

R15=XR(8)+XR(16) 

R16=XR(8)-XR(16) 

S15=XI(8)+XI(16) 

S16=XI(8)-XI(16) 

T1-R1+R9 

T2=R?  -  R9 

U1=S1+S9 

U2=S1"S9 

T3=R3+R11 

T4=R3"R11 

U3=S3+S11 

U4=S3-S11 

T5=R5+R13 

T6=R5-R13 

U5=S5+S13 

U6=S5-S13 

T7=R7+R15 

T8=R7-R15 

U7=S7+S15 

U8=S7-S15 

T9=C1*(T4+T8) 

T10=C1*(T4-T8) 

U9=C1*(U4+U8) 

UlO=Cl*(U4-U8) 

Rl=Tl+T5 

R3=T1-T5 

S1=U1-HJ5 

S3=U1-U5 

R5=T3+T7 

R7=T3-T7 

S5=U3+U7 

S7=U3-U7 


R9=*T2+T10 

R11=T2-T10 

S9=U2+U10 

S11=U2-U10 

R13=T6+T9 

Ri5=T6-T9 

S13=U6+U9 

S15=U6-U9 

T1=R4+R16 

T2=R4-R16 

U1=S4+S16 

U2=S4-S16 

T3=C1*(R6+R14) 

T4=C1*(R6-R14) 

U3=C1*(S6+S14) 

U4=C1*(S6-S14) 

T5=>R8+R12 

T6=R8-R12 

U5=S8+S12 

U6=S8-S12 

T7=C2*(T2-T6) 

T8=C3*T2-T7 

T9=C4*T6-T7 

T10-R2+T4 

T11=R2-T4 

R2=T10+T8 

R4=T10-T8 

R6=T11+T9 

R8=T11-T9 

U7=C2*(U2-U6) 

U8=C3*U2-U7 

U9=C4*U6-U7 

U10S2+U4 

Ull*S2-U4 

S2=U10+U8 

S4=U1(MJ8 

S6=U11+U9 

S8-U11HJ9 

T7=C5*(T1+T5) 

T8=T7-C4*T1 

T9=T7-C3*T5 

T10=R10+T3 

T11=R10-T3 

R10«T10+T8 

R12=T10-T8 

R14-T11+T9 

R16=T11-T9 

U7=>C5*(U1+U5) 

U8=»U7-C4*U1 

U9=U7-C3*U5 

U10-S10+U3 

U11=S1(MJ3 

S10*U10+U8 

S12=U10-U8 


i 


S14=U11+U9 

S16=U11-U9 

XR(l)=Rl+R5 

XR(9)=R1-R5 

XI(1)=S1+S5 

XI(9)=S1-S5 

XR(2)=R2+S10 

XR(16)=R2-S10 

XI(2)=S2-R10 

XI(16)=S2+R10 

XR(3)=R9+S13 

XR(15)=R9-S13 

XI(3)=S9-R13 

XI(15)=S9+R13 

XR(4)=R8-S16 

XR(14)=R8+S16 

XI(4)=S8+R16 

XI(14)=S8-R16 

XR(5)=R3+S7 

XR(13)=R3-S7 

XI(5)=S3-R7 

XI(13)=S3+R7 

XR(6)=R6+S14 

XR(12)=R6-S14 

XI(6)=S6-R14 

XI(12)=S6+R14 

XR(7)=R11-S15 

XR(ll)=Rll+S15 

XI(7)=S11+R15 

XI(11)=S11-R15 

XR(8)=R4-S12 

XR(10)=R4+S12 

XI(8)=S4+R12 

XI(10)=S4-R12 


Appendix  C.  SWIFT  SHORT  DFT  ALGORITHMS 


The  SWIFT  short  DFT  algorithms  are  given  for  lengths  3,  5,  7,  9,  and  16. 
The  algorithms  for  lengths  3  and  5  are  from  [9],  with  slight  modifications. 

In  the  modified  versions  shown  here,  duplicative  additions  are  eliminated. 

The  SWIFT  algorithms  for  lengths  2,  4,  and  8  are  identical  to  the  PFA 
algorithms  for  the  same  lengths  and  are  thus  omitted.  All  the  algorithms  are 
written  in  terms  of  real  multiplications  and  additions.  In  addition,  no  tri- 
vial  multiplications  are  used  in  these  algorithms.  The  real  and  imaginary 
parts  of  the  complex  input  data  are  represented  in  natural  order  by  XR(1), 
XR(2),...,  XR(N)  and  XI(1),  XI(2),...,  XI(N),  respectively.  The  complex  out¬ 
put  is  stored  in  natural  order  in  the  XR  and  XI  input  arrays.  The  values  Rl, 
R2,...,  SI,  S2,...,  Ul,  U2,...,  and  Tl,  T2,...  are  all  temporary  values 
derived  from  input  data  and  intermediate  results.  The  total  number  of  real 
multiplications  and  additions  for  each  DFT  is  listed  with  each  algorithm.  The 
algorithms  listed  here  have  not  been  optimized  with  respect  to  minimizing  the 
amount  of  temporary  storage  required. 

(1)  N=3;  4  real  multiplications,  12  real  additions,  u=2it/3. 

Coefficients:  Cl=-3/2 
02=810  u 

Rl=XR(2)+XR(3) 

R2=XR(2)-XR(3) 

S1=XI(2)+XI(3) 

S2=XI(2)-XI(3) 

XR(1)=R1+XR(1) 

XI(1)=S1+X1(1) 

T1=R1*C1 

T2=R2*C2 

U1=S2*C2 

U2=S1*C1 

T3=XR(1)+T1 

U3=XI(1)+U2 

XR(2)=T3+U1 

XR(3)=T3-U1 

XI(2)=U3-T2 

XI(3)=U3+T2 

(2)  N=5,  16  real  multiplications,  32  real  additions,  u=2x/5. 

Coefficients:  Cl=cos  u-1 
C2=cos  2u-l 
C3=sin  u 
C4=sin  2u 

R1=XR(2)+XR(5) 

R2=XR(2)-XR(5) 

S1=XI(2)+XI(5) 

S2=XI(2)-XI(5) 

R3=XR(3)+XR(4) 

R4=XR(3)-XR(4) 

S3=XI(3)+XI(4) 


S4=XI(3)-XI(4) 

T1=R1+R3 

U1=S1+S3 

XR(1)=XR(1)+T1 

XI(1)=XI(1)+U1 

T2=XR( 1 )+(  C1*R1 )+(C2*R3 ) 

T3=XR(1)+(C2*R1)+(C1*R3) 

T4=(C3*R2)+(C4*R4) 

T5=(C4*R2)-(C3*R4) 

U2=(C3*S2)+(C4*S4) 

U3=(C4*S2)-(C3*S4) 

U4=XI(1)+(C1*S1)+(C2*S3) 

U5=XI(1)+(C2*S1)+(C1*S3) 

XR(2)=T2+U2 

XR(3)=T3+U3 

XR(4)=T3-U3 

XR(5)-T2“U2 

XX(2)=U4-T4 

XI(3)=U5~T5 

XI(4)=U5+T5 

XI(5)=U4+T4 

(3)  N=7,  36  real  multiplications,  60  real  additions,  u*2m/7. 

Coefficients:  Cl=cos  u 
C2=cos  2u 
C3=cos  3u 
C4=sin  u 
C5=sin  2u 
C6=sin  3u 

R1=XR(2)+XR(7) 

R2=XR(2)-XR(7) 

S1=XI(2)+XI(7) 

S2=>XI(2)-XI(7) 

R3=XR(3)+XR(6) 

R4=XR(3)-XR(6) 

S3=XI(3)+XI(6) 

S4=XI(3)“XI(6) 

R5=XR(4)+XR(5) 

R6=XR(4)-XR(5) 

S5=XI(4)+XI(5) 

S6=XI(4)-XI(5) 

T1=R1+R3+R5 

U1=S1+S3+S5 

XR(1)=XR(1)+T1 

XI(1)=XI(1)+U1 

T2=XR(1)+(C1*R1)+(C2*R3)+(C3*R5) 

T3=XR(1)+(C2*R1)+(C3*R3)+(C1*R5) 

T4=XR(1)+(C3*R1)+(C1*R3)+(C2*R5) 

T5=(C4*R2)+(C5*R4)+(C6*R6) 

T6=(C5*R2)-( C6*R4 )-( C4*R6 ) 

T7=(C6*R2)-'(C4*R4)+(C5*R6) 

U2=( C4*S2 )+( C5*S4 )+( C6*S6) 
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U3=(C5*S2)-(C6*S4)-(C4*S6) 

U4=(C6*S2)-(C4*S4)+(C5*S6) 

U5“XI(l)+(Ci*Sl)+(C2*S3)+(C3*S5) 

U6=XI(1)+(C2*S1)+(C3*S3)+(C1*S5) 

U7=XI(1)+(C3*S1)+(C1*S3)+(C2*S5) 

XR(2)=T2+U2 

XR(3)=T3+U3 

XR(4)=T4+U4 

XR(5)=T4-U4 

XR(6)-T3~U3 

XR(7)=T2-U2 

XI(2)=U5-T5 

XI(3)=U6-T6 

XI(4)=U7-~T7 

XI(5)=U5+T5 

XI(6)=U6+T6 

XI(7)=U7+T7 


(4)  N=9,  44  real  multiplications,  88  real  additions,  u=2it/9. 


Coefficients: 


Cl=cos  u 
C2=co8  2u 
C3=cos  3u 
C4=cos  4u 
C5=sin  u 
C6=sin  2u 
C7=sin  3u 
C8=sin  4u 


R1=XR(2)+XR(9) 

R2=XR(2)-XR(9) 

31=XI(2)+XI(9) 

S2=XI(2)-XI(9) 

R3=XR(3)+XR(8) 

R4=XR(3)-XR(8) 

S3=XI(3)+XI(8) 

S4=XI(3)-XI(8) 

R5=XR ( 4 )+XR ( 7 ) 

R6=XR(4)-XR(7) 

S5=XI(4)+XI(7) 

,S6=XI(4)-XI(7) 

R7=»XR(5)+XR(6) 

R8=XR(5)-XR(6) 

S7=XI(5)+XI(6) 

S8=XI(5)-XI(6) 

T1=R1+R3+R5+R7 

U1=S1+S3+S5+S7 

XR(1)=XR(1)+T1 

XI(1)=XI(1)+U1 

T2=(C3*R5)+XR(1) 

T3=( C 1*R1 )+( C2*R3 )+( C4*R7 )+T2 
T4=( C2*R1 )+( C4*R3)+(C1*R7 )+T2 
T5=C3*(T1-R5)+R5+XR(1) 
T6=(C4*R1)+(C1*R3)+(C2*R7)+T2 
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T7=C7*R6 

T8=(C5*R2)+(C6*R4)+(C8*R8)+T7 

T9=(C6*R2)+(C8*R4)-(C5*R8)-T7 

T10=C7*(R2-'R4+R8) 

Tll=(C8*R2)-(C5*R4)-(C6*R8)+T7 

U2=C7*S6 

U3=(C5*S2)+(C6*S4)+(C8*S8)+U2 

U4=(C6*S2)+(C8*S4)-(C5*S8)-'U2 

U5=C7*(S2-S4+S8) 

U6“(C8*S2)*'(C5*S4)~(C6*S8)+U2 

U7=(C3*S5HXI(1) 

U8=(C1*S1)+(C2*S3)+(C4*S7)+U7 
U9=(C2*S1)+(C4*S3)+(C1*S7)+U7 
U10=C3*(U1-S5)+S5+XI ( 1 ) 

U11=(C4*S1)+(C1*S3)+(C2*S7)+U7 

XR(2)=T3+U3 

XR(3)=T4+U4 

XR(4)=T5+U5 

XR(5)=T6+U6 

XR(6)=T6-U6 

XR(7)*T5-U5 

XR(8)=T4-U4 

XR(9)=T3-U3 

XI(2)=U8-T8 

XI(3)=U9-T9 

XI(4)=U10-T10 

XI(5)=U11-T11 

XI(8)=U11+T11 

XI(7)=U10+T10 

XI(8)=U9+T9 

XI(9)=U8+T8 

(5)  N=16,  24  real  multiplications,  144  real  additions,  u=2n/16. 

Coefficients:  Cl=cos  u 
C2=cos  2u 
C3=cos  3u 

R1=XR(1)+XR(9) 

R2=XR(1)-XR(9) 

S1=XI(1)+XI(9) 

S2=XI(1)-XI(9) 

R3=XR(2)+XR(16) 

R4=XR(2)-XR(16) 

S3=XI(2)+XI(16) 

S4=XI(2)-XI(16) 

R5=XR( 3 )+XR (15) 

R6=XR(3)-XR(15) 

S5=XI(3)+XI(15) 

S6=XI(3)-X1(15) 

R7=XR(4)+XR(14) 

R8=XR(4)"XR(14) 

S7=XI(4)+XI(14) 

S8=XI(4)-XI(14) 
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^H^skssks 


R9=XR(5)+XR(13) 

R10=XR(5)-XR(13) 

S9*XI(5)+XI(13) 

S’ 0=XI(5)-XI(13) 

Ril=XR(6)+XR(12) 

R12=XR(6)-XR(12) 

Sll=XI(6/t-XI(12) 

S12=XI(6)-'XI(12) 

R13=*XR(7)+XR(11) 

R14»XR(7)-XR(11) 

S13=XI(7)+XI(11) 

S14“XI(7)*'XI(11) 

R15=XR(8)+XR(10) 

R16=XR(8)*'XR(10) 

S15=XI (8)4X1 (10) 

C16=XI(8)"XI(10) 

T1=R13-4R5 

T2=R13-R5 

T3°R14R9 

T4=R1~R9 

T5-T3+T1 

T6=T3~T1 

T7=C2*T2 

T8=R2-T7 

T9=R2+T7 

T10=R3+R15 

T11=>R3-R15 

T12=R7+R11 

T13-R7-R11 

T14=T10+T12 

T15=(C1*T11)4(C3*T13) 

T16=C2*(T10-T12) 

T17=(C3' ;il)-(Cl*T13) 

T18=C2*(R6+R14) 

T19=R6-R14 

T20=TX8+R10 

T21=T18-R10 

T22=*R4+R16 

T23=R4-R16 

T24=R8+R12 

T25=R8-R12 

T26=(C3*T22)+(C1*T24) 

T27=C2*(T23+T25) 

T28=(C1*T22)-(C3*T24) 

T29=T23-T25 

U1=S14S9 

U2=S1-S9 

U3=S5+S13 

U4=S5-S13 

U5“U1+U3 

U6=C2*U4 

U7=>S24U6 

U8=S2-U6 


U10=>S3+S15 

U11=S3-S15 

U12=S1H-S7 

U13=S11-S7 

U14=U10+U12 

U15=C2*(U10-U12) 

U16=(C1*U11)-(C3*U13) 

U17=(C3*U11)+(C1*U13) 

U18=C2*(S6+S14) 

U19=U18+S10 

U20=U18-S10 

U21=S6-S14 

U22=S4+S16 

U23=S4-S16 

U24=S8+S12 

U25=S8-S12 

U26=(C3*U22)+(C1*U24) 

U27=C2*(U23+U25) 

U28=(C1*U22)-(C3*U24) 

U29=U23-U25 

XR(1)-T5+T14 

XI(1)=U5+U14 

XR(5)=T6+U29 

XI(5)=4J9-T29 

XR(9)=T5-T14 

XI(9)=U5-U14 

XR(13)=T6"U29 

XI(13)=U9+T29 

T30=T8+T15 

U30=U26+U19 

T31=T8-T15 

U31=U26-U19 

XR(2)=T30+U30 

XR(8)=T31+U31 

XR(16)=T30-U30 

XR(10)=T31-U31 

T32=T4+T16 

U32=U27-HJ2l 

T33=T4-T16 

U33=U27-U21 

XR(3)=T32+032 

XR(7)=T33+U33 

XR(15)=T32-U32 

XR(11)=T33-U33 

T34=T9+T17 

U34=U28-HJ20 

T35=T9-T17 

U35=U28-'U20 

XR(4)=T34+U34 

XR(6)*T35+U35 

XR(14)=T34-U34 

XR(12)=T35-U35 

U36=U7+U16 

T36=T20+T26 


T37=T20-T26 

XI(2)=U36-T36 

XI(8)=U37+T37 

XI(16)=U36+T36 

XI(10)=U37-T37 

U38=‘U2+U15 

T38=*T19+T27 

U39=U2-U15 

T39=T19*'T27 

XI(3)=U38-T38 

XI(7)=U39+T39 

XI(15)=U38+T38 

XI(11)=U39"T39 

U40=U8+U17 

T40«T21+T28 

U41=U8-U17 

T41=T21-T28 

XI(4)=U40-T40 

XI(6)=U41+T41 

XI(14)=U40+T40 

XI(12)-U41-T41 
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